Background

As consultants specializing in environmental policy, we are looking to use text mining and sentiment analysis on historical periodicals from 2006 up until now to be able to track the level of support regionally for environmental related issues. We want to be able to secretly track regional support for environmental issues so that we may better decide where to allocate funding to support environmental policy agendas.

We are chosing to focus on articles that are related or deal with the topic of Climate Change, and want to look into the general sentiment of these articles, either positive or negative, relatively. We are using the LexusNexus search engine to do so, and have gather 100 articles from 6 different publications to conduct our analysis.

Approach

The publications chosen were the Chicago Daily Herald, the New York Times, the Washington Post, the Atlanta Journal-Constitution, the Star Tribune (Minneapolis, MN), and the Philadelphia Inquirer.

As evidenced by the spread of news articles, we’ve focused on first the entire U.S’s general sentiment with climate change via the Washington Post and New York Times, then the Southeast via the Atlanta Journal-Constitution. We also looked in the northeast region with the Philadelphia Inquirer, and then more towards the mid-west with the Star Tribune and Chicago Daily Herald.

Loading Data

The first step is to load in all of our data. Before we loaded in the data, we had to clean up the RTF files a bit and make it easier to load in, removing all the redundant and unnecessary coding and underlying aspects associated with an RTF file that appear when loading directly into R. This was accomplished by using an online RTF to TXT file converter . After this cleaning was done, the files can be loaded into the R.

StarTribune = read_lines("StarTribune.txt")
StarTribuneData = tibble(StarTribune)

PhillyInquirer = read_lines("PhillyInquirer.txt")
PhillyInquirerData = tibble(PhillyInquirer)

ATL = read_lines("ATL_Articles.txt")
ATLData = tibble(ATL)

DailyHerald = read_lines("Daily_Herald_climate_articles.txt")
DailyHeraldData = tibble(DailyHerald)

NYT = read_lines("NYT climate articles.txt")
NYTData = tibble(NYT)

WaPo = read_lines("WaPo Articles.txt")
WaPoData = tibble(WaPo)

Unnesting Tokens

The next step is to actually use the unnest_tokens function from the tidytext data set to separate our RTF file of our 100 articles into a dataframe with each row representing a word from the corpus of texts. We also made sure to remove stop words from this corpus to account for them causing issues with our sentiment analysis, and then generated the counts for all the words that were left.

From here, the first chunk of the text file was also removed, which was a numbered list of the 100 articles that were used to create the corpus. Including this would’ve caused a potential skew in the word count as this information was not from the literal article, but just initially informative information.

StarTribuneData = StarTribuneData %>%
  unnest_tokens(word, StarTribune) %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=TRUE)

head(StarTribuneData, 10)
## # A tibble: 10 x 2
##    word            n
##    <chr>       <int>
##  1 climate       742
##  2 change        559
##  3 minnesota     416
##  4 star          279
##  5 tribune       279
##  6 energy        205
##  7 minneapolis   180
##  8 carbon        172
##  9 water         147
## 10 global        143
PhillyInquirerData = PhillyInquirerData %>%
  unnest_tokens(word, PhillyInquirer) %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=TRUE)

head(PhillyInquirerData, 10)
## # A tibble: 10 x 2
##    word              n
##    <chr>         <int>
##  1 climate         993
##  2 change          728
##  3 philadelphia    307
##  4 warming         182
##  5 global          179
##  6 people          177
##  7 environmental   165
##  8 inquirer        141
##  9 national        141
## 10 gas             138
ATLData = ATLData %>%
  unnest_tokens(word, ATL) %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=TRUE)

head(ATLData, 10)
## # A tibble: 10 x 2
##    word             n
##    <chr>        <int>
##  1 climate        788
##  2 change         517
##  3 atlanta        353
##  4 georgia        292
##  5 journal        223
##  6 constitution   220
##  7 warming        191
##  8 energy         186
##  9 global         184
## 10 people         137
DailyHeraldData = DailyHeraldData %>%
  unnest_tokens(word, DailyHerald) %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=TRUE)

head(DailyHeraldData, 10)
## # A tibble: 10 x 2
##    word        n
##    <chr>   <int>
##  1 climate  1166
##  2 change    825
##  3 people    204
##  4 chicago   182
##  5 global    171
##  6 science   139
##  7 daily     136
##  8 u.s       133
##  9 warming   133
## 10 carbon    131
NYTData = NYTData %>%
  unnest_tokens(word, NYT) %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=TRUE)

head(NYTData, 10)
## # A tibble: 10 x 2
##    word          n
##    <chr>     <int>
##  1 climate    2264
##  2 change     1401
##  3 global      482
##  4 carbon      452
##  5 times       439
##  6 warming     415
##  7 york        415
##  8 people      389
##  9 world       377
## 10 emissions   301
WaPoData = WaPoData %>%
  unnest_tokens(word, WaPo) %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=TRUE)

head(WaPoData, 10)
## # A tibble: 10 x 2
##    word           n
##    <chr>      <int>
##  1 climate     1340
##  2 change       930
##  3 carbon       321
##  4 washington   296
##  5 energy       283
##  6 global       257
##  7 warming      247
##  8 post         240
##  9 emissions    214
## 10 people       205

Assigning Sentiments

From here we must decide which sentiment analysis method we’d like to use to move forward. The options are bing, afinn, and nrc. Bing classifies words into negative or positive sentiments, with 6,776 words loaded into this dataset and already included in the tidytext package. afinn and nrc are both from the textdata package, and are also good tables to use for sentiment analysis. afinn utilizes a positive to negative number scale to represent sentiments, with negative numbers being negative sentiment, 0 being neutral, and positive numbers being positive sentiment. nrc uses multiple different classifications to describe the sentiment of a word: trust, fear, negative, sadness, anger, surprise, positive, disgust, joy, and anticipation.

get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

We will create 3 new datasets for each publication that joins (via an inner join) each of the 3 sentimental analysis table options to a publication so that we can analyze the results using all 3 to make the best possible conclusions. This allows all of the words to be assigned a sentiment ‘score’.

Analyzing Sentiments

bing

Here we are joining the bing sentiment dataset to each of our 6 publications.

StarTribune_bing = StarTribuneData %>% 
  inner_join(get_sentiments("bing"))

PhillyInquirer_bing = PhillyInquirerData %>%
  inner_join(get_sentiments("bing"))

ATL_bing = ATLData %>%
  inner_join(get_sentiments("bing"))

DailyHerald_bing = DailyHeraldData %>%
  inner_join(get_sentiments("bing"))

NYT_bing = NYTData %>%
  inner_join(get_sentiments("bing"))

WaPo_bing = WaPoData %>%
  inner_join(get_sentiments("bing"))

Here is a full list of each word from our 6 publications with their assigned negative or positive sentiment.

Now we will move forward in conducting our sentiment analysis via the bing sentiment method.

Quick Comparisons

Below is simple tabling of each of the 6 publications to display their distribution in terms of counts for negative and positive words.

Publication negative positive
Atlanta Journal-Constitution 741 417
Chicago Daily Herald 571 362
New York Times 1108 589
Philadelphia Inquirer 791 423
Star Tribune 634 420
Washington Post 796 473
Visualization

We can better compare the 6 publications through a visualization of their sentiment distribution on a histogram.

Out of the 6 publications we looked into, it’s obvious that the general sentiment is negative for these articles, but more so noticeable with the New York Times, as well as the Washington Post, Atlanta Journal-Constitution, Philadelphia Inquirer, and Washington Post.

However, in thinking about climate change articles and scanning through a few of them before conducting our analysis, we’ve realized that most are reporting about the issue by explaining it as a threat and problem, something we need to focus a lot on because a lot of damage as already been caused and is coming from it. We think for this reason that it’s important to be looking at where there is a heavier skew of negative sentiment, as we believe this would be indicative of a more strong focus on the topic and realizing its daunting impact.

Word Clouds

Now that we’ve been able to draw some conclusions on the general sentiments of our 6 publications, we can compare similarities in word choice against the 6 to see if the general topic of climate change has similar patterns throughout, or if each of our publications are more specialized with what realms of climate change they are targeting and covering more.

Going off of the decision to focus in on the negative sentiments, we believe it’s important to keep in mind the context of the situation that we are conducting analysis for. Climate change is an impending issue and, as such, it makes sense for it to be described in articles in a negative manner. Looking at the word clouds above, which have words color coded with blue being positive and red being negative sentiment, we can see the most prominent words of each category. Words like “support”, “sustainability”, and “protection” are categorized positively, while using words like “issue”, “threat”, “risk”, “damage”, etc. are categorized negatively.

One thing to keep in mind is the fact that “warm” and “warmer” were labeled as positive sentiment, which in relation to climate change should be regarded in a negative manner, as this is a problem caused by it.

In general, after observing these word clouds, we’ve come to the conclusion, as prior, that the negative words seem to be indicative of a publication recognizing the important of climate change and its impacts and negative consequences, and thus would correlate to having higher support for trying to fix this and support policies that do so.

afinn

StarTribune_afinn = StarTribuneData %>% 
  inner_join(get_sentiments("afinn")) %>%
  mutate(sentiment = case_when(
    value < 0 ~ "negative",
    value > 0 ~ "positive"
  ))

PhillyInquirer_afinn = PhillyInquirerData %>%
  inner_join(get_sentiments("afinn"))%>%
  mutate(sentiment = case_when(
    value < 0 ~ "negative",
    value > 0 ~ "positive"
  ))

ATL_afinn = ATLData %>%
  inner_join(get_sentiments("afinn"))%>%
  mutate(sentiment = case_when(
    value < 0 ~ "negative",
    value > 0 ~ "positive"
  ))

DailyHerald_afinn = DailyHeraldData %>%
  inner_join(get_sentiments("afinn"))%>%
  mutate(sentiment = case_when(
    value < 0 ~ "negative",
    value > 0 ~ "positive"
  ))

NYT_afinn = NYTData %>%
  inner_join(get_sentiments("afinn"))%>%
  mutate(sentiment = case_when(
    value < 0 ~ "negative",
    value > 0 ~ "positive"
  ))

WaPo_afinn = WaPoData %>%
  inner_join(get_sentiments("afinn"))%>%
  mutate(sentiment = case_when(
    value < 0 ~ "negative",
    value > 0 ~ "positive"
  ))

Here is a full list of each word from our 6 publications with their assigned numerical ‘sentiment value’.

Now we will move forward in conducting our sentiment analysis via the bing sentiment method.

Quick Comparisons

Below is simple tabling of each of the 6 publications to display their distribution in terms of counts for values of words.

publication -5 -4 -3 -2 -1 1 2 3 4 5
Atlanta Journal-Constitution NA 6 85 278 122 125 171 36 8 NA
Chicago Daily Herald 1 3 70 226 100 110 152 38 9 3
New York Times NA 8 92 383 163 128 217 49 14 2
Philadelphia Inquirer NA 6 93 268 121 122 165 39 8 1
Star Tribune NA 3 57 227 102 111 165 44 9 1
Washington Post NA 5 75 299 129 121 177 38 11 1
Visualization

We can better compare the 6 publications through a visualization of their sentiment distribution on a histogram. We have also added a new column to our data that categorizes a value to be a positive sentiment if it’s positive and negative sentiment if it’s negative, and have also graphed this histogram to better and more easily see the negative and positive distributions again.

Looking at the sentiment values assigned by the afinn sentiment method, where more negative values are more negative sentiments and more positive values are more positive sentiments, we see a similar pattern from the bing data. The general skew is slightly more towards the negative side, however overall seems to be more evenly distributed than bing.

Another thing to note is the most prominent sentiment ‘values’ being -2, -1, 1, and 2, with the range being -5 to 5. This reflects that the words with extreme sentiment one way or another are not used as much as words with more ‘subtle’ sentiments.

Word Clouds

Now that we’ve been able to draw some conclusions on the general sentiments of our 6 publications, we can compare similarities in word choice against the 6 to see if the general topic of climate change has similar patterns throughout, or if each of our publications are more specialized with what realms of climate change they are targeting and covering more.

The word clouds generated from this sentiment method look a bit different from those in bing, however our general conclusion is the same in looking at the words in the cloud. The negative sentiments are more so recognizing the impact of the issue and highlighting the problem, while the blue is seeming to be using more positive ‘buzz words’ per say to appear positively but do not really reflect the idea that they’d necessarily support an environmentalist policy or support efforts to combat climate change.

nrc

StarTribune_nrc = StarTribuneData %>% 
  inner_join(get_sentiments("nrc"))

PhillyInquirer_nrc = PhillyInquirerData %>%
  inner_join(get_sentiments("nrc"))

ATL_nrc = ATLData %>%
  inner_join(get_sentiments("nrc"))

DailyHerald_nrc = DailyHeraldData %>%
  inner_join(get_sentiments("nrc"))

NYT_nrc = NYTData %>%
  inner_join(get_sentiments("nrc"))

WaPo_nrc = WaPoData %>%
  inner_join(get_sentiments("nrc"))

Here is a full list of each word from our 6 publications with their assigned sentiment, with a larger expanse of categories and more detailed categorization.

Now we will move forward in conducting our sentiment analysis via the bing sentiment method.

Quick Comparisons

Below is simple tabling of each of the 6 publications to display their distribution in terms of counts for word sentiments.

publication anger anticipation disgust fear joy negative positive sadness surprise trust
Atlanta Journal-Constitution 279 283 196 362 201 685 733 279 156 452
Chicago Daily Herald 230 265 149 300 188 563 673 237 144 412
New York Times 390 382 282 506 263 1012 961 391 204 551
Philadelphia Inquirer 311 289 218 397 205 733 748 312 158 442
Star Tribune 254 301 170 318 209 625 741 267 150 448
Washington Post 285 313 187 352 207 726 804 302 156 468
Visualization

We can better compare the 6 publications through a visualization of their sentiment distribution on a histogram.

The NRC sentiment is a bit more detailed than the other two have been. Words are separated into multiple categories in addition to just positive and negative sentiment. Here we see that for the most part the bars for negative and positive words are even, if anything a bit higher on the positive side. Thus, we are focusing more on the distribution of words within the other categories to make our conclusions.

Trust is the first category we think should be recognized, and it’s associated with words such as ‘president’, and other government-related terms. We do not think we can make much of a conclusion from higher trust sentiments, given that this could be swayed in either direction of either supporting or not supporting governmental policies, and should be regarded as a neutral term used in both situations. Fear, however, is a more impactful sentiment to be looking at, which is indicative of the articles concerns of the impact of climate change, as well as anger and sadness.

Word Clouds

Now that we’ve been able to draw some conclusions on the general sentiments of our 6 publications, we can compare similarities in word choice against the 6 to see if the general topic of climate change has similar patterns throughout, or if each of our publications are more specialized with what realms of climate change they are targeting and covering more.

The main conclusion to recognize from these word clouds is the very clear and obvious word ‘change’ being front and center. This could be due to its use in pairing with ‘climate change’ and any external use referencing the need for change, which could’ve cause its high frequency. This word is categorized as fear, which is important to note when looking at the histogram plots from above and remember that they may be slightly higher than reality due to this word’s prominent in the corpuses.

Conclusions and Recommendations

Now, moving forward with conclusions and recommendations, we are focusing on looking more into comparing sentiments across the publications so that we can decide where to focus lobbying efforts for more environmental policies.

Conclusions

The New York Times and Washington Post are both national newspapers; for all 3 sentiment methods they’ve reflected a clear country-wide recognition of climate change and its ‘negative’ sentiment. We are now focusing more into the other 4 publications used to see which area of the country to be targeting - northeast, southeast, midwest.

Looking at the bing sentiments, the Philadelphia Inquirer and Atlanta Journal-Constitution have a higher skew towards negative sentiment, while the Chicago Daily Herald and Star Tribune have sentiment frequencies that are closer together.

Looking at the afinn sentiments, we see the same results, with noticably higher generally negative sentiment for those two publications in comparison the the more midwest-regioned ones.

Looking at the nrc sentiments, the Chicago and Minnesota newspapers actually have slightly more positive than negative sentiment words, while the north and south east publications are closer to equal. However, fear, anger, and sadness words all, once again, show higher counts in the Atlanta and Philadelphia papers than the other two.

In total for all 3 sentiments, as evidenced by the word clouds, we recognized that we must take into consideration the context of what we’re analyzing and looking at, the issue being climate change. Publications and articles that reflect higher positive connotations are using words like ‘save’, ‘care’, ‘sustainability’ etc. For the most part, these words are primarily positive ‘buzz words’ that have little substance in terms of really attacking the climate change topic at hand. With this being a surface level, initial exploratory effort, we think that a more in-depth analysis could be done if only words that are positive for this situation are filtered for, like ‘clean’, ‘grant’, ‘solution’, ‘sustainability’, etc. and then a sentiment analysis done again, which takes out the redundant positively connotated words and provides a more robust analysis.

The negatively connotated are blunt and direct, and reflect the severity and direness of climate change. The publications issue out more focused articles that detail climate change issues and recognize the problem. We think these areas would be more responsive to environmental policies given their current and clear recognition and understanding of what’s going on. Often times actually accepting the problem and addressing it as an issue is the more difficult part with making change happen, and these areas have already done so.

In general, it is important to also do a general glance over in terms of what the top words are for both positive and negative connotations to make sure that they are relevant in context of climate change. Looking at words with top frequencies can also help tell from initial glance what the potential focuses are in terms of issues for each region. For example, looking back at the top 10 words for each of the 6 publications, Minnesota has energy, carbon, and water as top words, while warming and gas are top words in Philadelphia and warming and energy in Atlanta. Using this as an initial starting point, we can also try to recognize what issues are of importance to what areas and use this as leverage when lobbying policies. Carbon emissions, climate change, and energy are popular nationwide.

Recommendation

The Atlanta Journal-Constitution and Philadelphia Inquirer have recognizably higher counts of words with negative connotations than the two midwest publications, which we’ve deemed to be of important in this analysis. Thus we recommend that our client look further into these two locations first, gathering and looking into corpuses of articles from other publications from those areas and conducting the same type of analysis, but more detailed. We need to make sure that the Philadelphia Inquirer and Atlanta Journal Constitution did not just happened to be more focused on the direness of climate change than the general consensus of the location of the country they’re in. Comparing each of these two publications with others in their area for similar sentiments will solify that these are ideal locations to put more effort into learning about in terms of their current environmental issues and policies, as we know they would be good locations that would be receptive to environmental policies.

This is not to say that the Chicago and Minnesota publications are not in support of climate change related policies, just that the first two seem to be, from initial analysis, better locations to start focus on. We also recommend, if funds and time allow, to look into the midwest region, however we realize that these areas are relatively lower in negative sentiment.